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Abstract. The classical center based clustering problems such as fc-means/median/center assume that 
the optimal clusters satisfy the locality property that the points in the same cluster are close to each 
other. A number of clustering problems arise in machine learning where the optimal clusters do not 
follow such a locality property. For instance, consider the r-gather clustering problem where there is an 
additional constraint that each of the clusters should have at least r points or the capacitated clustering 
problem where there is an upper bound on the cluster sizes. Consider a variant of the fc-means problem 
that may be regarded as a general version of such problems. Here, the optimal clusters Oi,Ok are 
an arbitrary partition of the dataset and the goal is to output fc-centers ci,..., Ck such that the objective 
function II® “ ''*11^ minimized. It is not difficult to argue that any algorithm (without 

knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as 
optimizing the above objective function is concerned. However, this does not rule out the existence 
of algorithms that output a list of such k centers such that at least one of these k centers behaves 
well. Given an error parameter e > 0, let i denote the size of the smallest list of fc-centers such that 
at least one of the fc-centers gives a (1 + e) approximation w.r.t. the objective function above. In 
this paper, we show an upper bound on £ by giving a randomized algorithm that outputs a list of 
20(fe/e) ^.centers 0. We also give a closely matching lower bound of Moreover, our algorithm 

runs in time O (jid ■ This is a significant improvement over the previous result of Ding and 

Xu [DX15] who gave an algorithm with running time O (jid ■ (logn)*’ • and output a list of 

size O ^(log n)'' • . Our techniques generalize for the A:-median problem and for many other 

settings where non-Euclidean distance measures are involved. 


1 Introduction 

Clustering problems intend to classify high dimensional data based on the proximity of points to 
each other. There is an inherent assumption that the clusters satisfy locality property - points 
close to each other (in a geometric sense) should belong to the same category. Often, we model 
such problems by the notion of a center based clustering problem. We would like to identify a 
set of centers, one for each cluster, and then the clustering is obtained by assigning each point to 
the nearest center. For example, the /c-means problem is defined in the following manner: given 
a dataset X = {xi,... ,Xn} C and an integer k, output a set of k centers {ci,... ,Ck} C 
such that the objective function ™™ce{ci,...,cj,} 11® ~ c|p is minimized. The fc-median and the 

/c-center problems are defined in a similar manner by defining a suitable objective function. 

However, often such clustering problems entail several side constraints. Such constraints limit 
the set of feasible clusterings. For example, the r-gather /c-means clustering problem is defined in 
the same manner as the /c-means problem, but has the additional constraint that each cluster must 
have at least r points in it. In such settings, it is no longer true that the clustering is obtained from 
the set of centers by the Voronoi partition. Ding and Xu |DX15] began a systematic study of such 
problems, and this is the starting point of our work as well. They defined the so-called constrained 
k-means problem. An instance of such a problem is specihed by a set of points X, a parameter k, 
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and a set C, where each element of C is a partitioning of X into k disjoint subsets (or clusters). Since 
the set C may be exponentially large, we will assume that it is specified in a succinct manner by 
an efficient algorithm which decides membership in this set. A solution needs to output an element 
O = {Oi,... ,Ofc} of C, and a set of k centers, ci,... ,Cfc, one for each cluster in O. The goal is 
to minimize Yl!i=i Yhx&Oi 11^ “ check that the center c* must be the mean of the 

corresponding cluster Oi. Note that the /c-means problem is a special case of this problem where 
the set C contains all possible ways of partitioning X into k subsets. The constrained A;-median 
problem can be defined similarly. We will make the natural assumption (which is made by Ding 
and Xu as well) that it suffices to find a set of k centers. In other words, there is an (efficient) 
algorithm A^, which given a set of k centers ci,... ,Cfc, outputs the clustering {Oi, ... ,0^} € C 
such that Yl\=i'^x&Oi I Is “ ^11^ minimized. Such an algorithm is called a partition algorithm 
by Ding and Xu [DX15] q For the case of the A:-means problem, this algorithm will just give the 
Voronoi partition with respect to ci,..., , whereas in the case of the r-gather fe-means clustering 

problem, the algorithm will be given by a suitable min-cost flow computation (see section 4.1 
in |DX15) 1. 

Ding and Xu [DX15] considered several natural problems arising in diverse areas, e.g. machine 
learning, which can be stated in this framework. These included the so-called r-gather A:-means, 
r-capacity /c-means and /-diversity A:-means problems. Their approach for solving such problems 
was to output a list of candidate sets of centers (of size k) such that at least one of these were 
close to the optimal centers. We formalize this approach and show that if A: is a constant, then one 
can obtain a PTAS for the constrained A;-means (and the constrained A:-median) problems whose 
running time is linear plus a constant number of calls to A^. 

We define the list k-means problem. Given a set of points X and parameters k and £, we want 
to output a list C of sets of k points (or centers). The list C should have the following property: 
for any partitioning O = {Oi,..., Ofc} of X into k clusters, there exists a set ci,..., in the list 
£ such that (up-to reordering of these centers) 


EE 

i=l x£Oi 


Ci - X 


s(i+")EE 


\x — mil 


( 1 ) 


2=1 XSOi 


where m,- = 


ExGOj 

w~ 


denotes the mean of Oi. Note that the latter qnantity is the A:-means cost of 


the clustering O, and so we require ci,..., Cfc to be such that the cost of assigning to these centers 
is close to the optimal A;-means cost of this clustering. We shall use opt^(O) to denote the optimal 
A;-means cost of O. 

Although such an oblivious approach to clustering may appear too optimistic, we show that it 
is possible to obtain such a list C of size in O ^nd ■ time. This improves the result of 

Ding and Xu [DX15| . where they gave an algorithm which outputs a list of size O ^(log n)^ ■ 

Observe that we address a question which is both algorithmic and existential : how small can the 
size of £ be, and how efficiently can we find it ? We also give almost matching lower bounds on 
the size of such a list £. Our algorithm for finding £ relies on the D^-sampling idea - iteratively 
find the centers by picking the next one to be far from the cnrrent set of centers. Although these 
ideas have been used for the A:-means problems (see e.g. [.IKS 14] ). they rely heavily on the fact 
that given a set of centers, the corresponding clustering is obtained by the corresponding Voronoi 
partition. Our approach relies in showing that there is small sized list £ which works well for all 
possible clnsterings. 


^ |DX15| also gave a discussion on such partition algorithms for a number of clustering problems with side constraints. 
















It is not hard to show that a result for the list A:-means problem implies a corresponding result 
for the constrained fc-means problem with the number of calls to being equal to the size of the 
list £. Therefore, we obtain as corollary of our main result efficient algorithms for the constrained 
/c-means (and the constrained /c-median) problems. 

1.1 Related work 

The classical A:-means problem is one of the most well-studied clustering problems. There is a 
long sequence of work on obtaining fast PTAS for the /c-means and the /c-median problems (see 
e.g., [MatOniIBHPI021 IdlVKKBb.llIHPMnilIKSSlOllABSlOlKlhenblI.IKSUlIlMSOTj and references 
therein). Some of these works implicitly maintain a list of centers of size k such that the condition ([1]) 
is satisfied for all clusterings O which correspond to a Voronoi partition (with respect to a set of 
k centers) of the input set of points, and one picks the best possible set of centers from this 
list (see e.g., [KSSlOl IABSin( I.TKS14] ). The list has at most elements, and from this, 

one can recover a (1 -|- e)-approximation algorithm for the A:-means problem with running time 

The more general case of the constrained A-means problem was studied by Ding and Xu |DX15] 
who also gave an algorithm that outputs a list of size O ^(log fi'^k . 2 Poly{fc/£)^ ^ work improves 
upon this result. Moreover, we consider the formulation of the list A-means problem as an important 
contribution, and feel that similar formulations in other classification settings would be useful. 

1.2 Preliminaries 

We formally define the problems considered in this paper. The centroid or mean of a finite set of 
points X C is denoted by T(X) = —denote the 1-means cost of these set of 
points, i.e., Exex 

An input instance I for the list A-means (or the list A-median) problem consists of a set of 
points X, a positive integer k and a positive parameter e. A partition of X into disjoint subsets 
Oi,... ,0k will be called a clustering of X. Given a clustering O* = {Oi,..., of X and a set 
of k centers C = {ci,... ,Ck}, define costc(0*) as the minimum, over all permutations vr of C, 
of 11^ “ W(i)lP- Recall that opt;.(0*) denotes the optimal fe-means cost of O*, i.e., 

Si=i 11^ ~ 

For a set of points X and a set of points C (of size at most k), define ^c'(X) as Xlxex I \x— 

c|p, i.e., we consider the Voronoi partition of X induced by C, and consider the /c-means cost of X 
with respect to this partition. When considering the list /c-median problem, we will use the same 
notation, except that we will consider the Euclidean norm instead of the square of the Euclidean 
norm. When C is a singleton set {c}, we shall abuse notation by using ^c{X) instead of 

As mentioned in the introduction, the constrained /c-means problem is specified by a set of 
points X, a positive integer k, and a set C of feasible clusterings of X. Further, we are given an 
algorithm A^, which given a set of k centers C, outputs the clustering O in C which minimizes 
costc'(O). The goal is to find a clustering O G C and a set C of size k which minimizes costc(O). 
Note that the centers in C should just be the mean of each cluster in O. On the other hand, if we 
know C, then we can find the best clustering in C by calling We use the same notation for the 
constrained A:-median problem. 

We now mention a few results which will be used in our analysis. The following fact is well 
known. 













Fact 1. For any X C and c G we have J2x&x 11^ “ c|p = Ylx&x II® “ -^(^)IP + 1^1 ' Ik — 

r(x)||2. 

We next define the notion of D^-sampling. 

Definition 1 (D^-sampling). Given a set of points X C and another set of points C C 
-sampling from X w.r.t. C samples a point x G X with probability 

The following result of Inaba et al. |IKI94| shows that a constant size random sample is a good 
enough approximation of a set of points X as far as the 1-means objective is concerned. 

Lemma 1 ( |IKI94j ). Let S be a set of points obtained by independently sampling M points with 
replaeement uniformly at random from a point set X C Then for any 5 > 0, 
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We will also use the following simple fact that may be interpreted as approximate version of the 
triangle inequality for squared Euclidean distance. 

Fact 2 (Approximate triangle inequality). For any x,y,z G we have ||x — 2;|k < 2 • ||x — 

y|P + 2 • ||y - ^IP- 


1.3 Our results 

We now state our results for the list /c-means and the list /c-median problems. 

Theorem 1. Given a set of n points X C parameters k and e, there is a randomized al¬ 
gorithm which outputs a list L of sets of centers of size k such that for any clustering 

O* = {0^,...,0^} of X, the following event happens with probability at least 1/2 .■ there is a set 
C G C such that 

costc(0*) < (1 + e) • opt^(0*). 

Moreover, the running time of our algorithm is O (jid ■ . The same statement holds for the 

list k-median problem as well, except that the size of the list C becomes and the running 

time of our algorithm becomes O ^nd ■ . 

As a corollary of this result we get PTAS for the constrained /c-means problem (and similarly for 
the constrained A:-median problem). 

Corollary 1. There is a randomized algorithm which given an instance of the constrained k-means 
problem and parameter e > 0, outputs a solution of cost at most (1 + e)-times the optimal cost with 
probability at least \l‘l. Further, the time taken by this algorithm is O ^nd ■ -F ■ T, 

where T denotes the time taken by on this instance. 

Proof. We use the algorithm in Theorem [1] to get a list C for this data-set. For each set C G C, 
we invoke with C as the set of centers - let 0(6*) denote the clustering produced by . We 
output the clustering for which costc(0(C)) is minimum. Let O* be the optimal clustering, i.e., 
the clustering in C for which opt^(0*) is minimum. We know that with probability at least 1/2, 
there is set C G £ for which costc(0*) < (1 -|- e)opt^(0*). Now, the solution produced by our 
algorithm has cost at most costc(0(C')), which by definition of A^, is at most costc(0*). □ 









We also give a nearly matching lower bound on the size of C. The following result along with Yao’s 

q( —) 

Lemma shows that one cannot reduce the size of C to less than 2 VVe/. 

Theorem 2. Given a parameter k and a small enough positive constant e, there exists a set X 
of points in and a set C of clusterings of X such that any list L of centers of size k with the 

q( 

following property must have size at least 2 v ; for at least half of the clusterings O G C, there 
exists a set C in C such that costc(O) < (1 + e)opt;j(0). 

Our techniques also extend to settings involving many other “approximate” metric spaces (see 
the discussion in Section [6]). Another important observation is that in the lower bound result 
above, the clusterings in C correspond to Voronoi partitions of X. This throws light on the previous 
works [KSSlOl lABSlOl IFMS071IJKS141IJKY15] as to why the running time of all the algorithms 
was proportional to they were implicitly maintaining a list which satisfied ([T]) for all 

Voronoi partitions of X, and therefore, our lower bound result applies to their algorithms as well. 

1.4 Our Techniques 

Our techniques are based on the idea of H^-sampling that was used by Jaiswal et al. |JKS14] to 
give a (1+ e)-approximation algorithm for the /c-means problem. Our ideas also have similarities to 
the ideas of Ding and Xu |DX15] . We discuss these similarities towards the end of this subsection. 

One of the crucial ingredients that is used in most of the (1 + e)-approximation algorithms for 
/c-means is Lemma [TJ This result essentially states that given a set of points P, if we are able to 
uniformly sample 0(l/e) points from it, then the mean of these sampled points will be a good 
substitute for the mean of P. Consider an optimal clustering ..., for a set of points X. If 
we could uniformly sample from each of the clusters Of, then by the argument above, we will be 
done. The first problem one encounters is that one can only sample from the input set of points, 
and so, if we sample sufficiently many points from X, we need to somehow distinguish the points 
which belong to Of in this sample. This can be dealt with using the following argument: suppose 
we manage to get a small sample S of points (say of size 0{poly{k/e))) that contain at least f7(l/e) 
points uniformly distributed in Of, then we can try all possible subsets of S of size 0{l/e) and 
ensure that at least one of the subsets is a uniform sample of appropriate size from Of. Another 
issue is - how do we ensure that the sample S has sufficient representation from Of? Uniform 
sampling from the input X will not work since \Of\ might be really small compared to the size of 
|V|. This is where D^-sampling plays a crucial role and we discuss this next. 

Given a set of points X C R'^ and candidate centers ci,..., Cj G R'^, Z?^-sampling with respect to 
the centers ci, ...,Ci samples a point x G A with probability proportional to mincg{ci,...,ci} “ c|p. 
Note that this process “boosts” the probability of a cluster Of that has many points far from 
the set {ci,... ,Cj}. Therefore, even if a cluster Of has a small size, we will have a good chance 
of sampling points from it (if it is far from the current set of centers). However, this nonuniform 
sampling technique gives rise to another issue. The points being sampled are no longer uniform 
samples from the optimal clusters. Depending on the current set of centers, different points in a 
cluster Of have different probability of getting sampled. This issue is not that grave for the /c-means 
problem where the optimal clusters are Voronoi regions since we can argue that the probabilities 
are not very different. However, for the constrained /c-means problem where the optimal clusters are 
allowed to be arbitrary partition of the input points, this problem becomes more serious. This can 
be illustrated using the following example. Suppose we have managed to pick centers ci,..., Cj that 
are good (in terms of cluster cost) for the optimal clusters Of,...,Of. At this point let Of denote 
the cluster other than Of ,..., Of, such that a point sampled using sampling w.r.t. ci,..., c* is 
















most likely to be from O*. Suppose we sample a set S of 0{k/s) points using ID^-sampling. Are 
we guaranteed (w.h.p.) to have a subset in S that is a uniform sample from Ot? The answer is no 
(actually quite far from it). This is because the optimal clusters may form an arbitrary partition 
of the data-set and it is possible that most of the points in OJ might be very close to the centers 
Cl,..., Cj. In this case the probability of sampling such points will be close to 0. The way we deal 
with this scenario is that we consider a multi-set S' that is the union of the set of samples S and 
0(l/e) copies of each of ci,..., Cj. We then argue that all the points in OJ that is far from ci,..., Cj 
will have a good chance of being represented in S (and hence in S'). On the other hand, even 
though the points that are close to one of ci,... ,Cj will not be represented in S (and hence S'), 
the center (among ci,..., Cj) that is close to these points have good representation in S' and these 
centers may be regarded as “proxy” for the points in OJ. 

Ding and Xu |DX15| , instead of using the idea of D^-sampling, rely on the ideas of Kumar et 
al. [KSSIO] which involves uniform sampling of points and then pruning the data-set by removing 
the points that are close to centers that are currently being considered. In their work, they also 
encounter the problem that points from some optimal cluster might be close to the current set 
good centers (and hence will be removed before uniform sampling). Ding and Xu |DX15| deal with 
this issue using what they call a “simplex lemma”. Consider the same scenario as in the previous 
paragraph. At a very high level, they consider grids inside several simplices defined by the current 
centers ci,... ,Cj and the sampled points. Using the simplex lemma, they argue that one of the 
points inside these grids will be a good center for the cluster OJ. 

We now give an overview of the paper. In Section [2l we give the algorithm for generating the 
list of sets of centers for an instance of the list fc-means problem. The algorithm is analyzed in 
Section [3l In Section HI we give the lower bound result on the size of the list £. In Section [5l we 
discuss how our algorithm can be extended to the list /c-median problem. We conclude with a brief 
discussion on extensions to other metrics in Section [6l 

2 The Algorithm 

Consider an instance of the list fc-means problem. Let X denote the set of points, and e be a 
positive parameter. The algorithm List-fc-means is described in Figure [2Tl It maintains a set C of 
centers, which is initially empty. Each recursive call to the function Sample-centers increases the 
size of C by one. In Step 2 of this function, the algorithm tries out various candidates which can 
be added to C (to increase its size by 1). First, it builds a multi-set S as follows: it independently 
samples (with replacement) 0{k/e^) points using D^-sampling from X w.r.t. the set C. Further, it 
adds 0{l/e) copies of each of the centers in C to the set S. Having constructed S, we consider all 
subsets of size 0(l/e) of 5 - for each such subset we try adding the mean of this set to C. Thus, 
each invocation of Sample-centers makes multiple recursive calls to itself ((^) to be precise). 
It will be useful to think of the execution of this algorithm as a tree T of depth k. Each node in 
the tree can be labeled with a set C - it corresponds to the invocation of Sample-centers with 
this set as C (and i being the depth of this node). The children of a node denote the recursive 
function calls by the corresponding invocation of Sample-centers. Finally, the leaves denote the 
^t centers produced by the algorithm. 

In this section we prove Theorem [T] for the list A:-means problem. Let C denote the set of candidate 
solutions produced by List-fc-means, where a solution corresponds to a set of centers C of size k. 
These solutions are output at the leaves of the execution tree T. Fix a clustering O* = {O^,..., 
of X. Recall that a node v at depth i in the execution tree T corresponds to a set C of size i - call 








List-fc-means(X, fc, e) 

-Let N= 13644M M = iSa 

’ e 

- Initialize £ to 0. 

- Repeat 2^ times: 

- Make a call to Sample-centers(X, fc, e, 0, {}). 

- Return £. 

Sample-centers(X, k, e, i, C) 

(1) If {i = k) then add C to the set £. 

(2) else 

(a) Sample a multi-set S' of points with D^-sampling (w.r.t. centers C) 

(b) S' ^ S 

(c) For all c e C: S' t— S' U {M copies of c} 

(d) For all subsets T C S' of size M: 

(i) C ^ C\J{r{T)}. 

(ii) Sample-centers(X, k,e,i -\- 1, C) 

Algorithm 2.1. Algorithm for list fc-means 


this set Cv Our proof will argue inductively that for each i, there will be a node v at depth i such 
that the centers chosen so far in are good with respect to a subset of i clusters in ..., O^. 
We will argue that the following invariant P{i) is maintained during the recursive calls to Sample- 
centers: 


P{i)- With probability at least there is a node Vi at depth {i — 1) in the tree T and a 
set of (i — 1) distinct clusters O* , 01,..., O* , such that 

V/ G {1, ...,i - < (l + I) • ^ • 0ptfc(O*), (2) 

where ci,..., Ci_i are the centers in the set O^. corresponding to Vi. Recall that Z\(0* ) refers 
to the optimal 1-means cost of O* . 


The proof of the main theorem follows easily from this invariant property - indeed, the statement 
P{k) holds with probability at least 1/2^. Since the algorithm List-Zc-means invokes Sample- 
centers 2^ times, the probability of the statement in P{k) being true in at least one of these 
invocations is at least a constant. We now prove the invariant by induction on i. The base case 
for i = 1 follows trivially: the vertex vi is the root of the tree T and is empty. Now assume 
that P{i) holds for some i > 1. We will prove that P{i 1) also holds. We first condition on 
the event in P{i) (which happens with probability at least 2 ^^)- • • • > 

guaranteed by the invariant P(i). Let = {ci,..., Ci_i} (as in the statement P{i))- For sake of 
ease of notation, we assume without loss of generality that the index ji is i, and we shall use Ci 
to denote . Thus, the center q corresponds to the cluster O^*, 1 < I < i — 1. Note that for a 
cluster 0*,,i' > i, is proportional to the probability that a point sampled from X using 

D^-sampling w.r.t. Ci comes from the set O*, - let i G {i,..., /c} be the index i' for which d>Ci{0*,) 
is maximum. We will argue that the invocation of Sample-centers corresponding to Vi will try 
out a point c* (in Step 2(d) (i)) such that the following property will hold with probability at least 
1/2: <?ci(0/) < (1 +£/2) • L\(0/) -b (e/2A:) • opt^(0*). For doing this, we break the analysis into the 
following two parts. These two parts are discussed in the next two subsections that follow. 


Case I 





This captures the scenario where the probability of sampling from 


any of the uncovered clusters is very small. Note that for the classical A:-means problem, this is not 


an issue because in this case we can argue that the current set of centers C already provides a good 





approximation for the entire set of data points and we are done. However, for us this is an issue — 
for example, assuming i > 2, it is possible that some of the points in are close to ci, whereas 
the remaining points of this cluster are close to C 2 - Still we need to output a center for O^. In this 
case we argue that it will be sufficient to output a suitable convex combination of ci and C 2 - 


Case II 




> I: In this case, we argue that with good probability we will sample 


sufficient points from O? during Step 2(a) of Sample-centers. Further, we will show that a suitable 
combination of such points along with centers in Q will be a good center for Oi. 


Case I 





In this case we argue that a convex combination of the centers in Q provides a good approxi¬ 
mation to A(Ot). Intuitively, this is because the points in are close to the points in the set Ci. 
This convex combination is essentially “simulated” by taking 0(l/e) copies of each of the centers 
Cl,..., Ci-i in the multi-set S and then trying all possible subsets of size 0(l/e). The formal analysis 
follows. First, we note that ^aiOj) should be small compared to opt^(0*). 

Lemma 2. ^aiO^) < 4 ' 

Proof. Let D denote Yl^j=i (O*). The induction hypothesis and the fact that <Pci(O^) > <Pci{Oj),j 
i, imply that 


i—1 k i—1 

D = Y^,PcAo*) + ^ (i + 1) • + 1 • +k • 

j=l j=i j=l 


Since (PaiOj) < and < opt;.(0*), we get ^ •T)-|-(I -|- e) • opt;i,(0*). Thus, 

^ ^ Finally, < if^ • T) < ^ • optfc(0*). □ 

For each point p G 0^, let c{p) denote the closest center in Ci. We now define a multi-set O'-. as 
{c{p) : p € Of}. Note that 0-. is obtained by taking multiple copies of points in Ci. The remaining 
part of the proof proceeds in two steps. Let m* and m! denote the mean of Of and O'^ respectively. 
We first show that m* and m! are close, and so, assigning all the points of Of to m' will have cost 
close to A{Of). Secondly, we show that if we have a good approximation m" to m', then assigning 
all the points of O* to m” will also incur small cost (comparable to Z\(0?)). We now carry out 
these steps in detail. Observe that 


p&Of 


/ii2 / 


Lemma 3. \\m* — m'|p < —|q*|- 
Proof. Let n denote |0?|. Then, 


|m* — m '\P = 


I 




^ (p - c(p)) 
peOf 
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where the second last inequality follows from Cauchy-Schwartz d. 


□ 


® For any real numbers ai,..., am, <ir)^ I'm < of. 













Now we show that A(Oj) and ^(O^) are close. 

Lemma 4. A{Oj) < 2 • <PciiOj) + 2 • A{Oj). 

Proof. The lemma follows by the following inequalities: 

/ > r. Factfl] , 

Z\(0^) = ^ \\c{p) — m'W < ^ \\c{p) — m*\\ 

pGO* P&o* 

FactE] ^„ 

< 2. ^ {||c(p)-p|P + ||p-m*|p)=2.<5c,(0;‘)+2.Zl(0;‘). 

peo^ 
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Finally, we argue that a good center for O'^ will also serve as a good center for Oj. 

Lemma 5. Let m" be a point such that 'L>m"{0'j) < (l + |) • A{0'j). Then < (l + f) ' 

+ i: • opt;,(0*). 

Proof. Let n* denote \0^\. Observe that 


L>m"{Of) = ^ \\m" — p\\^ Fart[T] ^ ||^* _p||2 _j_ _ f^"||2 

psOj 

Fact [2] ^ Lemma [3] „ 

< Z\(Of) + 2n*(||m*-m'|p + ||m'-m"|p) < A{Of)+ 2 ■ L>Ci{Of)+ 2n^\\m'- 

< A{0\) + 2 • <Fc,(0-*) + 2 - Z1(0.)) < A{0^) + 2 • ^ • A{0-f) 

Lemmad] r , , Lemma[2] / f\ f 

< A(0-:) + 2 • ^cAO-*) + 2 • (^cAO-:) + A(O-I)) < (l + 2) ' ^ ' opt,m 


This completes the proof of the lemma. 


□ 


The above lemma tells us that it will be sufficient to obtain a (1 + e/8)-approximation to the 
1-means problem for the dataset Oj. Now, Lemma [T] tells us that there is a subset (again as a 
multi-set) O" of size ^ of 0( such that the mean m” of these points satishes the conditions of 
Lemma [5l Now, observe that O" will be a subset of the set S constructed in Step 2 of the algorithm 
Sample-center - indeed, in Step 2(c), we add more than ^ copies of each point in Ci to S. Now, 
in Step 2(d), we will try out all subsets of size ^ of S' and for each such subset, we will try adding 
its mean to Cj. In particular, there will be a recursive call of this function, where we will have 
Cj+i = Cj U {m'A as the set of centers. Lemma [5] now implies that Cj+i will satisfy the invariant 
P{i + 1). Thus, we are done in this case. 


Case II 


E 


13fc J 




In this case, we would like to prove that we add a good approximation to the mean of Oj to the set 
Ci. Again, consider the invocation of Sample-centers corresponding to Ci. We want the multi-set 
S to contain a good representation from points in the set OC Secondly, in order to apply Lemma [H 
we will need this representation to be a uniform sample from OC Since (PcA^i) — iffc ' ^dOj), 
the probability that a point sampled using sampling w.r.t. Ci is from O* is not too small. So, 
the multi-set S will have non-negligible representation from the set OC However the points from 



in S may not be a uniform sample from Oj. Indeed, suppose there is a good fraction of points of 
which are close to Ci, and remaining points of are quite far from Q. Then, D^-sampling w.r.t. to 
Ci will not give us a uniform sample from Ot. To alleviate this problem, we take sufficiently many 
copies of points in Ci and add them to the multi-set S. In some sense, these copies act as proxy for 
points in Oj that are too close to C*. Finally, we argue that one of the subsets of S “simulates” a 
uniform sample from Oj and the mean of this subset provides a good approximation for the mean 
of Oj. The formal analysis follows. 

We divide the points in Oj into two parts - points which are close to a center in Ci, and the 
remaining points. More formally, let the radius R be given by 


41 ■ |04| 


( 4 ) 


Define O” as the points in 04 which are within distance i? of a center in Ci, and O/ be the rest 
of the points in 04. As in Case I, we define a new set where each point in O? is replaced by a 
copy of the corresponding point in Oj. For a point p E O”, define c{p) as the closest center in Oj to 
p. Now define a multi-set Oj as O/ U {c{p) : p E O”}. Intuitively, O^ denotes the set of points that 
are same as 04 except that points close to centers in Oj have been “collapsed” to these centers by 
taking appropriate number of copies. Clearly, |0^| = |04|. At a high level, we will argue that any 
center that provides a good 1-means approximation for Oj also provides a good approximation for 
04. We will then focus on analyzing whether the invocation of Sample-centers tries out a good 
center for O^. 

I ^ 

We give some more notation. Let m* and m' denote the mean of 04 and O- respectively. Let n* 
and n denote the size of the sets 04 and O” respectively. First, we show that A(04) is large with 
respect to R. 


Lemma 6. A(04) = (Prn*iOj) > ^^R?. 

Proof. Let c be the center in Ci which is closest to m*. We divide the proof into two cases: 

(i) \\m* — c|| > ^ ■ R- For any point p E Oj, triangle inequality implies that 

5 4 

I Ip — m* 11 > II c{p) — m* 11 — II c{p) — p\ \ > - ■ R — R > - ■ R. 

Therefore, 

AiO^)> \\p-m*\\^>^R^ 
peOf 

(ii) ||m* — c|| < I • i?: In this case, we have 

^ - n* • ||m* - c||2 > - n* • ||m* - c||2 

@ 41n* 2 25n* 2 ^ 16n 2 

Q -rt r) Q -/l . 

This completes the proof of the lemma. □ 


Lemma 7. ||m* — m'|P < ^ ■ R? 

M M — 





Proof. Since the only difference between and are the points in Oj, we get 


\m* — m'\ 1^ = 


(n*)2 


^ (p - c{p)) 

pGOr 


< 


Tl > I I / \ I 19 ^ 


n 


< — ■ R\ 


2 — 


p&OV 


rr 


where the first inequality follows from the Cauchy-Schwartz inequality. 
We now show that ^(O^) is close to A{Oj). 

Lemma 8. A{0'-^ < AnR?‘ + 2 • A{Oj). 

Proof. The lemma follows from the following sequence of inequalities: 


□ 


\\c{p)-m'W^ + '^ \\p 


— m 


l \\2 


p&Of 
Fact E] »—.. 

< E 2(l|c{p) 

p&or- 


p&Oi 


+ ||p —m'|p)+ ^ ||p —m' 


/||2 


peOf 


< 2nR^ + 2 ^ Up - m'lp = 2nR^ + 2 ■ ^m'{Of) 

p&Oi 

Factm 2nR^ + 2- {A{0^) + \ 

Lemma [7] 

< 4nii2 + 2-Z\(0-*). 


n 


\m — m 




This completes the proof of the lemma. 

We now argue that any center that is good for O'-^ is also good for 04. 

Lemma 9. Let m" he such that ^rn"{0-^ < (l + • 4\(0^). Then L>m"{0^) < (l + f) • 4\(04). 

Proof. The lemma follows from the following inequalities: 

^m"{Oj) = ^ \\m" — p\\^ Fa^E ^ 11 ^* _p ||2 _|_ _ 11 ^* _ 

p&O* p&O* 


□ 


Fact [2] ^ , „ , „ Lemma [7] „ 

< Z\(04) + 2n*(||m*-m'||2 + ||m'-m"||2) < A{Oj) + 2nR^ + 2n* 

< Z\(04) + 2nR^ + 2 • - Z\(0.)j < A{0^) + 2nR^ + - • A{Oj) 


I / " 112 

m — m 


Lemma[8] ^ 

< Z\(0-*) + 2nR^ + --nR^ + -- A(Oi 

'• I' 2 4 

This completes the proof of the lemma. 


Lemma [6] 
< 


l + -)-A{0: 


□ 


Given the above lemma, all we need to argue is that our algorithm indeed considers a center 
m" such that ^rn"{0^ < (1 + e/16) • Z\(0^). For this we would need about 0(l/e) uniform samples 
from Ot. However, our algorithm can only sample using O^-sampling w.r.t. Ci. For ease of notation, 
let c{Oj) denote the multi-set {c{p) : p G O^}. Recall that O'j consists of O/ and c{Oj). The first 
observation is that the probability of sampling an element from O/ is reasonably large (proportional 
to e/k). Using this fact, we show how to sample from (almost uniformly). Finally, we show how 
to convert this almost uniform sampling to uniform sampling (at the cost of increasing the size of 
sample). 









Lemma 10. Let x be a sample from -sampling w.r.t. Ci. Then, Pr[x G O/] > Further, for 

f 2 

any point p ^ Of, Pr[x = p] > where 7 denotes 

Proof. Note that Ylp^ot\of = p] < I — Therefore, the fact that we are 

in case II implies that 


Pr[x G O/] > Pr[x G Of] 


Pr[xGOf\0/] > 




41 ^Ci{X) - 15k' 


Also, if X G O/, then ^Ci{{x}) > = fp • Therefore, 


MM)>^ f! J_ 

^Ci(A:) “ 13fe “ 13A: 41 \Of\ “ 533fe |0-* 


This completes the proof of the lemma. 


□ 


Let Ai,... A; be Z points sampled independently using D^-sampling w.r.t. Ci. We construct a 
new set of random variables Id,... ,1^. Each variable will depend on A„ only, and will take 
values either in O-^ or will be T. These variables are defined as follows: if A„ ^ O/, we set Idi to T. 
Otherwise, we assign to one of the following random variables with equal probability: (i) A^ or 
(ii) a random element of the multi-set c{Of). The following observation follows from Lemma flOl 


Corollary 2. For a fixed index u, and an element x G O-i, Pr[Tu = > 7 ^, where 7' = 7 / 2 . 

* \o-.\ 


Proof. If X G O/, then we know from Lemma [lo] that Xu 


is X with probability at least (note 


that and Of have the same cardinality). Conditioned on this event, Yu will be equal to A„ with 
probability 1/2. Now suppose x G c{Of). Lemma [TOl implies that A„ is an element of O/ with 
probability at least Conditioned on this event, Yu will be equal to x with probability at least 

h • |e( 0 -")| - Therefore, the probability that Xu is equal to x is at least ■ 2 \c{or)\ > ^ 


Corollary [2] shows that we can obtain samples from 0( which are nearly uniform (up to a 
constant factor). To convert this to a set of uniform samples, we use the idea of [.IKS14| . For an 
element x G let jx be such that denotes the probability that the random variable 1/ 


IS 


equal to x (note that this is independent of u). Corollary [2] implies that jx > 'j'■ We define a new 
set of independent random variables Zi,..., Zi. The random variable Zu will depend on Yu only. If 


Fu is T, Zu is also T. If Fn is equal to x G O--, then Z^ 


takes the value x with probability 


and 


T with the remaining probability. Note that Zu is either T or one of the elements of 0(. Further, 
conditioned on the latter event, it is a uniform sample from O^. We can now prove the key lemma. 


Lemma 11. Let I be and m" denote the mean of the non-null samples from Zi,..., Zi. Then, 
with probability at least 1 / 2 , < (1 + e/16) • A{Oj). 


Proof. Note that a random variable Zu is equal to a specific element of Oj with probability equal to 
A-r. Therefore, it takes T value with probability 1 — . Now consider a different set of iid random 

' i' 

variables Zf, 1 < u < I as follows: each Zu tosses a coin with probability of Heads being 7 '. If we 

















get Heads, it gets value ±, otherwise it is equal to a random element of O'-^. It is easy to check that 
the joint distribution of the random variables Z'^ is identical to that of the random variables Z^- 
Thus, it suffices to prove the statement of the lemma for the random variables Z'^. 

Now we condition on the coin tosses of the random variables Let n' be the number of 
the number of random variables which are not T. {n' is a deterministic quantity because we have 
conditioned on the coin tosses). Let m" be the mean of such non-T variables among ..., If 
m” happens to be larger than 64/e, Lemma [T] implies that with probability at least 3/4, ^^"(0-) < 
(1 + e/16) • 41(0-). 

Finally, observe that the expected number of non-T random variables is -I > 128/e. Therefore, 
with probability at least 3/4, the number of non-T elements will be at least 64/e. □ 

Let denote the multi-set obtained by taking I copies of each of the centers in Oj. Now 
observe that all the non-T elements among are elements of {Xi,..., Xi} U and 

so the same must hold for Zi, ..., Zi. This implies that in Step 2(d) of the algorithm Sample- 
centers, we would have tried adding the point m" as described in Lemma [TTJ Therefore, the 
induction hypothesis continues to hold with probability at least 1/2. This concludes the proof of 
Theorem [TJ 


4 Lower Bound 


In this section, we prove the lower bound result Theorem [2j Consider parameters k and e (assume 
e is a small enough constant). We first define the set of points X. Let m denote . The points 
will belong to where d = km. The set X will have d points, namely, ei,..., e^, where Cj denotes 
the vector which has all coordinates 0, except for the coordinate, which is 1. Now, we define the 
set C of clusterings of X. The set C will consist of those clusterings O = {Oi,... ,0^} for which 
each of the clusters has exactly m points. Observe that 


|C| 


{km)\ 

(m!)^ 


( 5 ) 


Now fix a set C of fc centers, ci,... ,Cfc. We will now upper bound the number of clusterings 
O G C for which 


costc’(O) < (1 + e)opt^(O). 


( 6 ) 


Let O = {Oi,..., Ok} be as above. Note that 

k 

opt^(O) = ^ 4\(Oj) = km ■ ((1 — 1/m)^ + (m — 1) • 1/m^) = k{m — 1) (7) 

i=l 


Recall that costc(O) is obtained by assigning each cluster in O to a unique center in C, and then 
by computing the sum of square of distances of points in X to the corresponding centers. Wlog we 
rearrange the clusters in O such that the points in Oj are assigned to Cj. For a vector v, we shall 
use {v)j to denote the coordinate of v. For every center Cr we define a corresponding vector Vr 
as follows: 


{Vr)j 


{Cr)j if Bj ^ Or 
{cr)j — ^ otherwise 



Lemma 12. . Il'i^rlP < t ^ i\ • 

11 ^11 — m(m—l) 

Proof. Fix a cluster Or- Let rur denote the mean of Or- Note that {mr)j is 1/m if ej G 0^, 0 
otherwise. We now simplify the expression costc(O) as follows: 

COStc(O) = ^ ^ \\ej — Cr\f (\\ej — mr\f + \\mr — CrW'^) 

r=l ej£Or ^=1 ejGOr 

k k 

= opt^(o) + m ■ I \mr — Cr\\^ = opt^(O) + m | I'^’rl 

r=l 7"=! 


By our assumption, costc(O) < (1 + e)opt;j,(0). Therefore 
k 


< i-.optfc(O) = ^-/cCm-l) < ^ 


r=l 


m{m — 1) 


□ 

Now define a corresponding assignment function / : X —?■ as follows: f{ej) = r if 

Cj G Or- Let O' = {0[,... ,0'f^} be another clustering in C which satisfies condition ([6]). Define 
vectors u/ and the assignment function f in a similar manner. The following lemma shows that / 
and f cannot differ in too many coordinates. 

Lemma 13. Let D denote the set of indices j for which f{ej) ^ f'{ej). Then \D\ < d/2. 

Proof. Assume for the sake of contradiction that \D\ > d/2. For cluster Or, let Dr denote the set 
of indices j such that Cj G Or AO/. Observe that (vr)j and (v/)j differ (in absolute value) by 1/m. 
Therefore, 




E(( 

j&D, 


Ur jj 


1 


m 


> 


\Dr 


rn^ 


m 


\{Vr)j\. 


jeDr 


Summing over r = 1,..., fc, we get 

U 

2\D 


r=l 


r-M > 




m m 

r=i jeDr 


\'Y'Y 

\ r=l jeDr 


where the last inequality follows from Cauchy-Schwarz, and the observation that \ Dr \ = 2|D| > 
d. Using Lemma dll we see that 


h 2 

^-* 

m m 

r=l 


m 


.El 

\ r=l 


I 112 ^ ^ 
\Vr\\ ^- 


Ak 


> 


k 


m my/m — 1 m(m —1)’ 
assuming m is a large enough constant. But this contradicts Lemma [T2j 


□ 


The above lemma shows that the number of clusterings in C satisfying condition 

km 


is small. 

(km/2)\ 


Corollary 3. The number of clusterings in C satisfying condition is at most {i^r!p 2 ) ' ((rra/^i/fe • 















Proof. Fix a clustering O = {Oi,..., Or} satisfying condition ([ 6 ]), and let / be the corresponding 
assignment function. How many assignment functions (corresponding to a clustering in C) can differ 
from / in at most d/2 coordinates ? There are at most of choosing the coordinates in 

which the two functions differ. Consider a fixed choice of such coordinates, and say there are dr 
coordinates corresponding to points in Or- Let d' denote Ylr^r (and so, d' < d/2). Now, we need 
to partition these coordinates into sets of size di,... ,dk (note that /' corresponds to a clustering 
where all clusters are of equal size). The number of possibilities here is which is at most 

{d/2)\ 

{d/2k)\)*‘ ■ 


□ 


Recall that we want C to contain enough elements such that for at least half of the clusterings 
in C, condition Q is satished with respect to some set of centers in C. Therefore, Corollary [3] and 
(j5|) imply that 


\C\ > 


{km)\ 

(m\)^ 


/ km \ 

This concludes the proof of Theorem [2J 


{km/2)\ 

((m/2)!)'' 


_ 2^{km) C2p{kly/e) 


5 Extension to the list fc-median problem 

The setting for the list /c-median problem is same as that for the list /s-means problem, except for the 
fact that distances are measured using the Euclidean norm (instead of the square of the Euclidean 
norm). As before, for a set C oik centers, and a clustering O = {Oi,..., Ok} of a set of points X, 


X - Cr 


(i) 


Define 


define costc>(0) as the minimum, over all permutations vr of C, of Ylxeo 
opt^(O), <PciX) analogously. For a set of points X, let A{X) denote the optimal 1-median cost of 
A, i.e., min^gK-i E 


x&X 


|x — c||. We no longer have an analogue of Fact [H- for a set of points X, 
if c* denotes the optimal center with respect to the 1 -median objective, and c is a point such that 
^c{X) < (1 -|- e) • ^c*iX), it is possible that ||c — c*|| is large. This also implies that there is no 
analogue of the Lemma [TJ However, instead of the approximate triangle inequality (Fact[2|), we get 
triangle inequality in the Euclidean metric. 

We shall use a result of Kumar et al. |KSS10| . which gives an alternative to Lemma [U although 
it outputs several candidate centers instead of just the mean of a random sample. 

Lemma 14 (Theorem 5.4 |KSS10] L Given a random sample (with replacement) R of size ^ 
from a set of points X G there is a procedure construct(R), which outputs a set core(ii) of 
size such that the following event happens with probability at least 1/2 .• there is at least one 

point c G core(ii) such that ^c(^) < (1 + ^)' ^i^)- The time taken by the procedure construct(ii) 

. d 


is O 

Now we explain the changes needed in the algorithm and the analysis. Given a set of points X 
and another set of points C, D-sampling from X w.r.t. C samples a point x G A with probability 
proportional to <Pcix), i.e., mincgc l|c “ 3 ;||. 


5.1 The algorithm 

The algorithm is the same as that in Eigure 12.11 except for some minor changes in the procedure 
Sample-Centers, and changes in the values of the various parameters. The parameters a and (3 
in the procedure List- /c- median are large enough constants. We briefly describe the changes in 
the procedure Sample-Centers. In Step 2(a), we sample the multi-set S using D-sampling w.r.t 











C. We replace Step 2(d) by the following: for all subsets T C 5' of size M, and for all elements 
c G core(r) (i) C •(— C U {c}, (ii) Sample-centers(X, A:, e, i + 1,0). Recall that core(T) is the 
set guaranteed by Lemma [TH In other words, unlike for the fe-means setting, where we could just 
work with the mean of T, we now need to try out all the elements in core(T). Figure ISTTl gives a 
detailed description of the algorithm. 


List-fc-median(X, k, e) 

_ Let AT = M = |r, £ ^ 0. 

- Repeat 2*^ times: 

- Make a call to Sample-centers(X, k, e, 0, {}) and output the union of lists returned by these calls. 

- Return C. 

Sample-centers(X, k, e, i, C) 

(1) If {i = k) then add C to C. 

(2) else 

(a) Sample a multiset S oi N points with D-sampling (w.r.t. centers C) 

(b) S' 

(c) For all c £ C: S' t— S'' U {M copies of c} 

(d) For all subsets T G S' of size M and for all elements c £ core(T): 

(i) C ^CU{c}. 

(ii) Sample-centers(X, fc, e, i + 1, C) 


Algorithm 5.1. Algorithm for list fc-median. 


5.2 Analysis 

The analysis proceeds along the same lines as in Section [3l and we would again like to prove the 
induction hypothesis P{i). We use the same notation as in Section [3l and define Cases I and II 
analogously. Consider Case I first. Proof of Lemma [2] remains unchanged. The set is defined 
similarly. Let m* be the point for which A{Oi) = ^rn{Oi). Define m! analogously for the set O'-^. 
The statement of Lemma U] now changes as follows: 


A{0'j) < \\c{p) - m'W < ^||c(p)-m*l| < ^ (||c(p) - p|| + ||p - m*||) 

pGO* 

= <Pc.{0^) + A{0^) ( 8 ) 


Proof of Lemma [5] also changes as follows: let m" be as in the statement of this lemma. Then, 

^ (IIp-c(p)II+ ||c(p)-m"||) 
psO? P^^J 

® / F\ Lemma [2] p / f\ 

< 2 . + (l + s) • ^ + (^ + s) ■ 

Rest of the arguments remain unchanged (we use Lemma [T^ instead of Lemma[T]). Now we consider 
Case II. We redefine the parameter R as 


e 

9 |0? 




Define sets O'-^, c{0^), Oj as before. Let m* be the point for which Z\(0?) = ^^*{0^), and m' be 
the analogous point for Oj. Proof of Lemma [6] can be easily modified to yield the following (instead 
of Fact [H we just need to use triangle inequality) : 


A{0^) = ^ 


(9) 


We have the following version of Lemma [8j 




— m 


peov 


P&04 


< 


{\\p - m*\\ + \\c{p) - p\\) + Y \\p-m* 


peOf 

<nR + A{Ol), 


p&Oi 


where n denotes \Oi\. Finally, let m" be as in the statement of Lemma [9l Then, 

= Y \\P ~ + Y \\P ' 


— m 


P^Ol 

•* I 

< Y {Mp)-^”\\ + Mp)- p\\) + Y Wp-^" 


p&o^ 


P&04 


< nR + {Oj) A: nR + ^1 + —^ • Z\(0^) 

(Uni / ex (M 

< 3ni? + +-j • Z\(Oj) < (1 + e) • Z\(0^). 


( 10 ) 


( 11 ) 


Rest of the arguments go through without any changes. 


6 Conclusion 

We formulated the list A:-means problem and gave nearly tight upper and lower bounds on the size 
of the list of candidate centers. We also obtained an algorithm for the constrained fc-means problem 
getting a significant improvement over the previous results of Ding and Xu |DX15j . Furthermore, 
we show how our techniques generalize for the corresponding /c-median problems. We would also 
like to point out that our techniques generalize for settings that involve non-Euclidean distance 
measures. After going through the analysis of our algorithm, it is not difficult to show that the only 
properties that are used in the analysis are: 

(i) Symmetry of the distance measure (used implicitly) 

(ii) (Approximate) Triangle Inequality: Fact [2] 

(iii) Centroid property: Fact [U 

(iv) Sampling property: Lemma [T] 

The analysis holds even for some approximate versions of the above properties. For instance, for the 
/c-median problem we were able to use Lemma flTl instead of Lemma [T] (i.e., the sampling property). 
Also, we were able to work without the centroid property since for the fe-median problem the 
distances follow the exact triangle inequality instead of the approximate version (i.e., Fact [2]). We 




note that there are a number of clustering problems in machine learning that are modeled as 
fe-median problem over distance measures that follow the above properties in some approximate 
sense. Mahalanobis distance and ^-similar Bregman divergence are two examples of such distance 
measures. Our results can be very easily extended for the fc-median problem over such distance 
measures R. 
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